AAAI.2022 - Speech and Natural Language Processing

Total: 151

#1 Pinpointing Fine-Grained Relationships between Hateful Tweets and Replies [PDF] [Copy] [Kimi]

Authors: Abdullah Albanyan ; Eduardo Blanco

Recent studies in the hate and counter hate domain have provided the grounds for investigating how to detect this pervasive content in social media. These studies mostly work with synthetic replies to hateful content written by annotators on demand rather than replies written by real users. We argue that working with naturally occurring replies to hateful content is key to study the problem. Building on this motivation, we create a corpus of 5,652 hateful tweets and replies. We analyze their fine-grained relationships by indicating whether the reply (a) is hate or counter hate speech, (b) provides a justification, (c) attacks the author of the tweet, and (d) adds additional hate. We also present linguistic insights into the language people use depending on these fine-grained relationships. Experimental results show improvements (a) taking into account the hateful tweet in addition to the reply and (b) pretraining with related tasks.

#2 Cross-Modal Coherence for Text-to-Image Retrieval [PDF] [Copy] [Kimi]

Authors: Malihe Alikhani ; Fangda Han ; Hareesh Ravi ; Mubbasir Kapadia ; Vladimir Pavlovic ; Matthew Stone

Common image-text joint understanding techniques presume that images and the associated text can universally be characterized by a single implicit model. However, co-occurring images and text can be related in qualitatively different ways, and explicitly modeling it could improve the performance of current joint understanding models. In this paper, we train a Cross-Modal Coherence Model for text-to-image retrieval task. Our analysis shows that models trained with image–text coherence relations can retrieve images originally paired with target text more often than coherence-agnostic models. We also show via human evaluation that images retrieved by the proposed coherence-aware model are preferred over a coherence-agnostic baseline by a huge margin. Our findings provide insights into the ways that different modalities communicate and the role of coherence relations in capturing commonsense inferences in text and imagery.

#3 Enhanced Story Comprehension for Large Language Models through Dynamic Document-Based Knowledge Graphs [PDF1] [Copy] [Kimi1]

Authors: Berkeley R Andrus ; Yeganeh Nasiri ; Shilong Cui ; Benjamin Cullen ; Nancy Fulda

Large transformer-based language models have achieved incredible success at various tasks which require narrative comprehension, including story completion, answering questions about stories, and generating stories ex nihilo. However, due to the limitations of finite context windows, these language models struggle to produce or understand stories longer than several thousand tokens. In order to mitigate the document length limitations that come with finite context windows, we introduce a novel architecture that augments story processing with an external dynamic knowledge graph. In contrast to static commonsense knowledge graphs which hold information about the real world, these dynamic knowledge graphs reflect facts extracted from the story being processed. Our architecture uses these knowledge graphs to create information-rich prompts which better facilitate story comprehension than prompts composed only of story text. We apply our architecture to the tasks of question answering and story completion. To complement this line of research, we introduce two long-form question answering tasks, LF-SQuAD and LF-QUOREF, in which the document length exceeds the size of the language model's context window, and introduce a story completion evaluation method that bypasses the stochastic nature of language model generation. We demonstrate broad improvement over typical prompt formulation methods for both question answering and story completion using GPT-2, GPT-3 and XLNet.

#4 Diagnostics-Guided Explanation Generation [PDF] [Copy] [Kimi]

Authors: Pepa Atanasova ; Jakob Grue Simonsen ; Christina Lioma ; Isabelle Augenstein

Explanations shed light on a machine learning model's rationales and can aid in identifying deficiencies in its reasoning process. Explanation generation models are typically trained in a supervised way given human explanations. When such annotations are not available, explanations are often selected as those portions of the input that maximise a downstream task's performance, which corresponds to optimising an explanation's Faithfulness to a given model. Faithfulness is one of several so-called diagnostic properties, which prior work has identified as useful for gauging the quality of an explanation without requiring annotations. Other diagnostic properties are Data Consistency, which measures how similar explanations are for similar input instances, and Confidence Indication, which shows whether the explanation reflects the confidence of the model. In this work, we show how to directly optimise for these diagnostic properties when training a model to generate sentence-level explanations, which markedly improves explanation quality, agreement with human rationales, and downstream task performance on three complex reasoning tasks.

#5 Mitigating Reporting Bias in Semi-supervised Temporal Commonsense Inference with Probabilistic Soft Logic [PDF] [Copy] [Kimi]

Authors: Bibo Cai ; Xiao Ding ; Bowen Chen ; Li Du ; Ting Liu

Acquiring high-quality temporal common sense (TCS) knowledge from free-form text is a crucial but challenging problem for event-centric natural language understanding, due to the language reporting bias problem: people rarely report the commonly observed events but highlight the special cases. For example, one may rarely report "I get up from bed in 1 minute", but we can observe "It takes me an hour to get up from bed every morning'' in text. Models directly trained upon such corpus would capture distorted TCS knowledge, which could influence the model performance. Prior work addresses this issue mainly by exploiting the interactions among temporal dimensions (e.g., duration, temporal relation between events) in a multi-task view. However, this line of work suffers the limitation of implicit, inadequate and unexplainable interactions modeling. In this paper, we propose a novel neural-logic based Soft Logic Enhanced Event Temporal Reasoning (SLEER) model for acquiring unbiased TCS knowledge, in which the complementary relationship among dimensions are explicitly represented as logic rules and modeled by t-norm fuzzy logics. SLEER can utilize logic rules to regularize its inference process. Experimental results on four intrinsic evaluation datasets and two extrinsic datasets show the efficiency of our proposed method.

#6 Adversarial Training for Improving Model Robustness? Look at Both Prediction and Interpretation [PDF] [Copy] [Kimi]

Authors: Hanjie Chen ; Yangfeng Ji

Neural language models show vulnerability to adversarial examples which are semantically similar to their original counterparts with a few words replaced by their synonyms. A common way to improve model robustness is adversarial training which follows two steps—collecting adversarial examples by attacking a target model, and fine-tuning the model on the augmented dataset with these adversarial examples. The objective of traditional adversarial training is to make a model produce the same correct predictions on an original/adversarial example pair. However, the consistency between model decision-makings on two similar texts is ignored. We argue that a robust model should behave consistently on original/adversarial example pairs, that is making the same predictions (what) based on the same reasons (how) which can be reflected by consistent interpretations. In this work, we propose a novel feature-level adversarial training method named FLAT. FLAT aims at improving model robustness in terms of both predictions and interpretations. FLAT incorporates variational word masks in neural networks to learn global word importance and play as a bottleneck teaching the model to make predictions based on important words. FLAT explicitly shoots at the vulnerability problem caused by the mismatch between model understandings on the replaced words and their synonyms in original/adversarial example pairs by regularizing the corresponding global word importance scores. Experiments show the effectiveness of FLAT in improving the robustness with respect to both predictions and interpretations of four neural network models (LSTM, CNN, BERT, and DeBERTa) to two adversarial attacks on four text classification tasks. The models trained via FLAT also show better robustness than baseline models on unforeseen adversarial examples across different attacks.

#7 Unsupervised Editing for Counterfactual Stories [PDF] [Copy] [Kimi]

Authors: Jiangjie Chen ; Chun Gan ; Sijie Cheng ; Hao Zhou ; Yanghua Xiao ; Lei Li

Creating what-if stories requires reasoning about prior statements and possible outcomes of the changed conditions. One can easily generate coherent endings under new conditions, but it would be challenging for current systems to do it with minimal changes to the original story. Therefore, one major challenge is the trade-off between generating a logical story and rewriting with minimal-edits. In this paper, we propose EDUCAT, an editing-based unsupervised approach for counterfactual story rewriting. EDUCAT includes a target position detection strategy based on estimating causal effects of the what-if conditions, which keeps the causal invariant parts of the story. EDUCAT then generates the stories under fluency, coherence and minimal-edits constraints. We also propose a new metric to alleviate the shortcomings of current automatic metrics and better evaluate the trade-off. We evaluate EDUCAT on a public counterfactual story rewriting benchmark. Experiments show that EDUCAT achieves the best trade-off over unsupervised SOTA methods according to both automatic and human evaluation. The resources of EDUCAT are available at: https://github.com/jiangjiechen/EDUCAT.

#8 LOREN: Logic-Regularized Reasoning for Interpretable Fact Verification [PDF] [Copy] [Kimi]

Authors: Jiangjie Chen ; Qiaoben Bao ; Changzhi Sun ; Xinbo Zhang ; Jiaze Chen ; Hao Zhou ; Yanghua Xiao ; Lei Li

Given a natural language statement, how to verify its veracity against a large-scale textual knowledge source like Wikipedia? Most existing neural models make predictions without giving clues about which part of a false claim goes wrong. In this paper, we propose LOREN, an approach for interpretable fact verification. We decompose the verification of the whole claim at phrase-level, where the veracity of the phrases serves as explanations and can be aggregated into the final verdict according to logical rules. The key insight of LOREN is to represent claim phrase veracity as three-valued latent variables, which are regularized by aggregation logical rules. The final claim verification is based on all latent variables. Thus, LOREN enjoys the additional benefit of interpretability --- it is easy to explain how it reaches certain results with claim phrase veracity. Experiments on a public fact verification benchmark show that LOREN is competitive against previous approaches while enjoying the merit of faithful and accurate interpretability. The resources of LOREN are available at: https://github.com/jiangjiechen/LOREN.

#9 ContrastNet: A Contrastive Learning Framework for Few-Shot Text Classification [PDF] [Copy] [Kimi]

Authors: Junfan Chen ; Richong Zhang ; Yongyi Mao ; Jie Xu

Few-shot text classification has recently been promoted by the meta-learning paradigm which aims to identify target classes with knowledge transferred from source classes with sets of small tasks named episodes. Despite their success, existing works building their meta-learner based on Prototypical Networks are unsatisfactory in learning discriminative text representations between similar classes, which may lead to contradictions during label prediction. In addition, the task-level and instance-level overfitting problems in few-shot text classification caused by a few training examples are not sufficiently tackled. In this work, we propose a contrastive learning framework named ContrastNet to tackle both discriminative representation and overfitting problems in few-shot text classification. ContrastNet learns to pull closer text representations belonging to the same class and push away text representations belonging to different classes, while simultaneously introducing unsupervised contrastive regularization at both task-level and instance-level to prevent overfitting. Experiments on 8 few-shot text classification datasets show that ContrastNet outperforms the current state-of-the-art models.

#10 From Good to Best: Two-Stage Training for Cross-Lingual Machine Reading Comprehension [PDF] [Copy] [Kimi]

Authors: Nuo Chen ; Linjun Shou ; Ming Gong ; Jian Pei

Cross-lingual Machine Reading Comprehension (xMRC) is a challenging task due to the lack of training data in low-resource languages. Recent approaches use training data only in a resource-rich language (such as English) to fine-tune large-scale cross-lingual pre-trained language models, which transfer knowledge from resource-rich languages (source) to low-resource languages (target). Due to the big difference between languages, the model fine-tuned only by the source language may not perform well for target languages. In our study, we make an interesting observation that while the top 1 result predicted by the previous approaches may often fail to hit the ground-truth answer, there are still good chances for the correct answer to be contained in the set of top k predicted results. Intuitively, the previous approaches have empowered the model certain level of capability to roughly distinguish good answers from bad ones. However, without sufficient training data, it is not powerful enough to capture the nuances between the accurate answer and those approximate ones. Based on this observation, we develop a two-stage approach to enhance the model performance. The first stage targets at recall; we design a hard-learning (HL) algorithm to maximize the likelihood that the top k predictions contain the accurate answer. The second stage focuses on precision, where an answer-aware contrastive learning (AA-CL) mechanism is developed to learn the minute difference between the accurate answer and other candidates. Extensive experiments show that our model significantly outperforms strong baselines on two cross-lingual MRC benchmark datasets.

#11 Probing Linguistic Information for Logical Inference in Pre-trained Language Models [PDF] [Copy] [Kimi]

Authors: Zeming Chen ; Qiyue Gao

Progress in pre-trained language models has led to a surge of impressive results on downstream tasks for natural language understanding. Recent work on probing pre-trained language models uncovered a wide range of linguistic properties encoded in their contextualized representations. However, it is unclear whether they encode semantic knowledge that is crucial to symbolic inference methods. We propose a methodology for probing knowledge for inference that logical systems require but often lack in pre-trained language model representations. Our probing datasets cover a list of key types of knowledge used by many symbolic inference systems. We find that (i) pre-trained language models do encode several types of knowledge for inference, but there are also some types of knowledge for inference that are not encoded, (ii) language models can effectively learn missing knowledge for inference through fine-tuning. Overall, our findings provide insights into which aspects of knowledge for inference language models and their pre-training procedures capture. Moreover, we have demonstrated language models' potential as semantic and background knowledge bases for supporting symbolic inference methods.

#12 On the Transferability of Pre-trained Language Models: A Study from Artificial Datasets [PDF] [Copy] [Kimi]

Authors: Cheng-Han Chiang ; Hung-yi Lee

Pre-training language models (LMs) on large-scale unlabeled text data makes the model much easier to achieve exceptional downstream performance than their counterparts directly trained on the downstream tasks. In this work, we study what specific traits in the pre-training data, other than the semantics, make a pre-trained LM superior to their counterparts trained from scratch on downstream tasks. We propose to use artificially constructed datasets as the pre-training data to exclude the effect of semantics, and further control what characteristics the pre-training corpora have. By fine-tuning the pre-trained models on GLUE benchmark, we can learn how beneficial it is to transfer the knowledge from the model trained on the dataset possessing that specific trait. We define and discuss three different characteristics in the artificial dataset: 1) matching the token's uni-gram or bi-gram distribution between pre-training and downstream fine-tuning, 2) the presence of the explicit dependencies among the tokens in a sequence, 3) the length of the implicit dependencies among the tokens in a sequence. Our experiments show that the explicit dependencies in the sequences of the pre-training data are critical to the downstream performance. Our results also reveal that models achieve better downstream performance when pre-trained on a dataset with a longer range of implicit dependencies. Based on our analysis, we find that models pre-trained with artificial datasets are prone to learn spurious correlation in downstream tasks. Our work reveals that even if the LMs are not pre-trained on natural language, they still gain transferability on certain human language downstream tasks once the LMs learn to model the token dependencies in the sequences. This result helps us understand the exceptional transferability of pre-trained LMs.

#13 C2L: Causally Contrastive Learning for Robust Text Classification [PDF] [Copy] [Kimi]

Authors: Seungtaek Choi ; Myeongho Jeong ; Hojae Han ; Seung-won Hwang

Despite the super-human accuracy of recent deep models in NLP tasks, their robustness is reportedly limited due to their reliance on spurious patterns. We thus aim to leverage contrastive learning and counterfactual augmentation for robustness. For augmentation, existing work either requires humans to add counterfactuals to the dataset or machines to automatically matches near-counterfactuals already in the dataset. Unlike existing augmentation is affected by spurious correlations, ours, by synthesizing “a set” of counterfactuals, and making a collective decision on the distribution of predictions on this set, can robustly supervise the causality of each term. Our empirical results show that our approach, by collective decisions, is less sensitive to task model bias of attribution-based synthesis, and thus achieves significant improvements, in diverse dimensions: 1) counterfactual robustness, 2) cross-domain generalization, and 3) generalization from scarce data.

#14 Novelty Controlled Paraphrase Generation with Retrieval Augmented Conditional Prompt Tuning [PDF] [Copy] [Kimi]

Authors: Jishnu Ray Chowdhury ; Yong Zhuang ; Shuyi Wang

Paraphrase generation is a fundamental and long-standing task in natural language processing. In this paper, we concentrate on two contributions to the task: (1) we propose Retrieval Augmented Prompt Tuning (RAPT) as a parameter-efficient method to adapt large pre-trained language models for paraphrase generation; (2) we propose Novelty Conditioned RAPT (NC-RAPT) as a simple model-agnostic method of using specialized prompt tokens for controlled paraphrase generation with varying levels of lexical novelty. By conducting extensive experiments on four datasets, we demonstrate the effectiveness of the proposed approaches for retaining the semantic content of the original text while inducing lexical novelty in the generation.

#15 Flexible Instance-Specific Rationalization of NLP Models [PDF] [Copy] [Kimi]

Authors: George Chrysostomou ; Nikolaos Aletras

Recent research on model interpretability in natural language processing extensively uses feature scoring methods for identifying which parts of the input are the most important for a model to make a prediction (i.e. explanation or rationale). However, previous research has shown that there is no clear best scoring method across various text classification tasks while practitioners typically have to make several other ad-hoc choices regarding the length and the type of the rationale (e.g. short or long, contiguous or not). Inspired by this, we propose a simple yet effective and flexible method that allows selecting optimally for each data instance: (1) a feature scoring method; (2) the length; and (3) the type of the rationale. Our method is inspired by input erasure approaches to interpretability which assume that the most faithful rationale for a prediction should be the one with the highest difference between the model's output distribution using the full text and the text after removing the rationale as input respectively. Evaluation on four standard text classification datasets shows that our proposed method provides more faithful, comprehensive and highly sufficient explanations compared to using a fixed feature scoring method, rationale length and type. More importantly, we demonstrate that a practitioner is not required to make any ad-hoc choices in order to extract faithful rationales using our approach.

#16 InfoLM: A New Metric to Evaluate Summarization & Data2Text Generation [PDF] [Copy] [Kimi]

Authors: Pierre Jean A. Colombo ; Chloé Clavel ; Pablo Piantanida

Assessing the quality of natural language generation (NLG) systems through human annotation is very expensive. Additionally, human annotation campaigns are time-consuming and include non-reusable human labour. In practice, researchers rely on automatic metrics as a proxy of quality. In the last decade, many string-based metrics (e.g., BLEU or ROUGE) have been introduced. However, such metrics usually rely on exact matches and thus, do not robustly handle synonyms. In this paper, we introduce InfoLM a family of untrained metrics that can be viewed as a string-based metric that addresses the aforementioned flaws thanks to a pre-trained masked language model. This family of metrics also makes use of information measures allowing the possibility to adapt InfoLM to different evaluation criteria. Using direct assessment, we demonstrate that InfoLM achieves statistically significant improvement and two figure correlation gains in many configurations compared to existing metrics on both summarization and data2text generation tasks.

#17 Nice Perfume. How Long Did You Marinate in It? Multimodal Sarcasm Explanation [PDF] [Copy] [Kimi]

Authors: Poorav Desai ; Tanmoy Chakraborty ; Md Shad Akhtar

Sarcasm is a pervading linguistic phenomenon and highly challenging to explain due to its subjectivity, lack of context and deeply-felt opinion. In the multimodal setup, sarcasm is conveyed through the incongruity between the text and visual entities. Although recent approaches deal with sarcasm as a classification problem, it is unclear why an online post is identified as sarcastic. Without proper explanation, end users may not be able to perceive the underlying sense of irony. In this paper, we propose a novel problem -- Multimodal Sarcasm Explanation (MuSE) -- given a multimodal sarcastic post containing an image and a caption, we aim to generate a natural language explanation to reveal the intended sarcasm. To this end, we develop MORE, a new dataset with explanation of 3510 sarcastic multimodal posts. Each explanation is a natural language (English) sentence describing the hidden irony. We benchmark MORE by employing a multimodal Transformer-based architecture. It incorporates a cross-modal attention in the Transformer's encoder which attends to the distinguishing features between the two modalities. Subsequently, a BART-based auto-regressive decoder is used as the generator. Empirical results demonstrate convincing results over various baselines (adopted for MuSE) across five evaluation metrics. We also conduct human evaluation on predictions and obtain Fleiss' Kappa score of 0.4 as a fair agreement among 25 evaluators.

#18 Zero-Shot Commonsense Question Answering with Cloze Translation and Consistency Optimization [PDF] [Copy] [Kimi]

Authors: Zi-Yi Dou ; Nanyun Peng

Commonsense question answering (CQA) aims to test if models can answer questions regarding commonsense knowledge that everyone knows. Prior works that incorporate external knowledge bases have shown promising results, but knowledge bases are expensive to construct and are often limited to a fixed set of relations. In this paper, we instead focus on better utilizing the implicit knowledge stored in pre-trained language models. While researchers have found that the knowledge embedded in pre-trained language models can be extracted by having them fill in the blanks of carefully designed prompts for relation extraction and text classification, it remains unclear if we can adopt this paradigm in CQA where the inputs and outputs take much more flexible forms. To this end, we investigate four translation methods that can translate natural questions into cloze-style sentences to better solicit commonsense knowledge from language models, including a syntactic-based model, an unsupervised neural model, and two supervised neural models. In addition, to combine the different translation methods, we propose to encourage consistency among model predictions on different translated questions with unlabeled data. We demonstrate the effectiveness of our methods on three CQA datasets in zero-shot settings. We show that our methods are complementary to a knowledge base improved model, and combining them can lead to state-of-the-art zero-shot performance. Analyses also reveal distinct characteristics of the different cloze translation methods and provide insights on why combining them can lead to great improvements. Code/dataset is available at https://github.com/PlusLabNLP/zero_shot_cqa.

#19 Synthetic Disinformation Attacks on Automated Fact Verification Systems [PDF] [Copy] [Kimi]

Authors: Yibing Du ; Antoine Bosselut ; Christopher D. Manning

Automated fact-checking is a needed technology to curtail the spread of online misinformation. One current framework for such solutions proposes to verify claims by retrieving supporting or refuting evidence from related textual sources. However, the realistic use cases for fact-checkers will require verifying claims against evidence sources that could be affected by the same misinformation. Furthermore, the development of modern NLP tools that can produce coherent, fabricated content would allow malicious actors to systematically generate adversarial disinformation for fact-checkers. In this work, we explore the sensitivity of automated fact-checkers to synthetic adversarial evidence in two simulated settings: ADVERSARIAL ADDITION, where we fabricate documents and add them to the evidence repository available to the fact-checking system, and ADVERSARIAL MODIFICATION, where existing evidence source documents in the repository are automatically altered. Our study across multiple models on three benchmarks demonstrates that these systems suffer significant performance drops against these attacks. Finally, we discuss the growing threat of modern NLG systems as generators of disinformation in the context of the challenges they pose to automated fact-checkers.

#20 Regularizing End-to-End Speech Translation with Triangular Decomposition Agreement [PDF] [Copy] [Kimi]

Authors: Yichao Du ; Zhirui Zhang ; Weizhi Wang ; Boxing Chen ; Jun Xie ; Tong Xu

End-to-end speech-to-text translation (E2E-ST) is becoming increasingly popular due to the potential of its less error propagation, lower latency, and fewer parameters. Given the triplet training corpus〈speech, transcription, translation〉, the conventional high-quality E2E-ST system leverages the〈speech, transcription〉pair to pre-train the model and then utilizes the〈speech, translation〉pair to optimize it further. However, this process only involves two-tuple data at each stage, and this loose coupling fails to fully exploit the association between triplet data. In this paper, we attempt to model the joint probability of transcription and translation based on the speech input to directly leverage such triplet data. Based on that, we propose a novel regularization method for model training to improve the agreement of dual-path decomposition within triplet data, which should be equal in theory. To achieve this goal, we introduce two Kullback-Leibler divergence regularization terms into the model training objective to reduce the mismatch between output probabilities of dual-path. Then the well-trained model can be naturally transformed as the E2E-ST models by a pre-defined early stop tag. Experiments on the MuST-C benchmark demonstrate that our proposed approach significantly outperforms state-of-the-art E2E-ST baselines on all 8 language pairs while achieving better performance in the automatic speech recognition task.

#21 Play the Shannon Game with Language Models: A Human-Free Approach to Summary Evaluation [PDF] [Copy] [Kimi]

Authors: Nicholas Egan ; Oleg Vasilyev ; John Bohannon

The goal of a summary is to concisely state the most important information in a document. With this principle in mind, we introduce new reference-free summary evaluation metrics that use a pretrained language model to estimate the information content shared between a document and its summary. These metrics are a modern take on the Shannon Game, a method for summary quality scoring proposed decades ago, where we replace human annotators with language models. We also view these metrics as an extension of BLANC, a recently proposed approach to summary quality measurement based on the performance of a language model with and without the help of a summary. Using transformer based language models, we empirically verify that our metrics achieve state-of-the-art correlation with human judgement of the summary quality dimensions of both coherence and relevance, as well as competitive correlation with human judgement of consistency and fluency.

#22 Fortunately, Discourse Markers Can Enhance Language Models for Sentiment Analysis [PDF] [Copy] [Kimi]

Authors: Liat Ein-Dor ; Ilya Shnayderman ; Artem Spector ; Lena Dankin ; Ranit Aharonov ; Noam Slonim

In recent years, pretrained language models have revolutionized the NLP world, while achieving state of the art performance in various downstream tasks. However, in many cases, these models do not perform well when labeled data is scarce and the model is expected to perform in the zero or few shot setting. Recently, several works have shown that continual pretraining or performing a second phase of pretraining (inter-training) which is better aligned with the downstream task, can lead to improved results, especially in the scarce data setting. Here, we propose to leverage sentiment-carrying discourse markers to generate large-scale weakly-labeled data, which in turn can be used to adapt language models for sentiment analysis. Extensive experimental results show the value of our approach on various benchmark datasets, including the finance domain. Code, models and data are available at https://github.com/ibm/tslm-discourse-markers.

#23 Retrieve, Caption, Generate: Visual Grounding for Enhancing Commonsense in Text Generation Models [PDF] [Copy] [Kimi]

Authors: Steven Y. Feng ; Kevin Lu ; Zhuofu Tao ; Malihe Alikhani ; Teruko Mitamura ; Eduard Hovy ; Varun Gangal

We investigate the use of multimodal information contained in images as an effective method for enhancing the commonsense of Transformer models for text generation. We perform experiments using BART and T5 on concept-to-text generation, specifically the task of generative commonsense reasoning, or CommonGen. We call our approach VisCTG: Visually Grounded Concept-to-Text Generation. VisCTG involves captioning images representing appropriate everyday scenarios, and using these captions to enrich and steer the generation process. Comprehensive evaluation and analysis demonstrate that VisCTG noticeably improves model performance while successfully addressing several issues of the baseline generations, including poor commonsense, fluency, and specificity.

#24 Language Model Priming for Cross-Lingual Event Extraction [PDF] [Copy] [Kimi]

Authors: Steven Fincke ; Shantanu Agarwal ; Scott Miller ; Elizabeth Boschee

We present a novel, language-agnostic approach to "priming" language models for the task of event extraction, providing particularly effective performance in low-resource and zero-shot cross-lingual settings. With priming, we augment the input to the transformer stack's language model differently depending on the question(s) being asked of the model at runtime. For instance, if the model is being asked to identify arguments for the trigger "protested", we will provide that trigger as part of the input to the language model, allowing it to produce different representations for candidate arguments than when it is asked about arguments for the trigger "arrest" elsewhere in the same sentence. We show that by enabling the language model to better compensate for the deficits of sparse and noisy training data, our approach improves both trigger and argument detection and classification significantly over the state of the art in a zero-shot cross-lingual setting.

#25 Language Modelling via Learning to Rank [PDF] [Copy] [Kimi]

Authors: Arvid Frydenlund ; Gagandeep Singh ; Frank Rudzicz

We consider language modelling (LM) as a multi-label structured prediction task by re-framing training from solely predicting a single ground-truth word to ranking a set of words which could continue a given context. To avoid annotating top-k ranks, we generate them using pre-trained LMs: GPT-2, BERT, and Born-Again models. This leads to a rank-based form of knowledge distillation (KD). We also develop a method using N-grams to create a non-probabilistic teacher which generates the ranks without the need of a pre-trained LM. We confirm the hypotheses: that we can treat LMing as a ranking task and that we can do so without the use of a pre-trained LM. We show that rank-based KD generally gives a modest improvement to perplexity (PPL) -- though often with statistical significance -- when compared to Kullback–Leibler-based KD. Surprisingly, given the naivety of the method, the N-grams act as competitive teachers and achieve similar performance as using either BERT or a Born-Again model teachers. Unsurprisingly, GPT-2 always acts as the best teacher. Using it and a Transformer-XL student on Wiki-02, rank-based KD reduces a cross-entropy baseline from 65.27 to 55.94 and against a KL-based KD of 56.70.